tutorials/003 - Amazon S3.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3 - Amazon S3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "* [1. CSV files](#1.-CSV-files)\n", "\t* [1.1 Writing CSV files](#1.1-Writing-CSV-files)\n", "\t* [1.2 Reading single CSV file](#1.2-Reading-single-CSV-file)\n", "\t* [1.3 Reading multiple CSV files](#1.3-Reading-multiple-CSV-files)\n", "\t\t* [1.3.1 Reading CSV by list](#1.3.1-Reading-CSV-by-list)\n", "\t\t* [1.3.2 Reading CSV by prefix](#1.3.2-Reading-CSV-by-prefix)\n", "* [2. JSON files](#2.-JSON-files)\n", "\t* [2.1 Writing JSON files](#2.1-Writing-JSON-files)\n", "\t* [2.2 Reading single JSON file](#2.2-Reading-single-JSON-file)\n", "\t* [2.3 Reading multiple JSON files](#2.3-Reading-multiple-JSON-files)\n", "\t\t* [2.3.1 Reading JSON by list](#2.3.1-Reading-JSON-by-list)\n", "\t\t* [2.3.2 Reading JSON by prefix](#2.3.2-Reading-JSON-by-prefix)\n", "* [3. Parquet files](#3.-Parquet-files)\n", "\t* [3.1 Writing Parquet files](#3.1-Writing-Parquet-files)\n", "\t* [3.2 Reading single Parquet file](#3.2-Reading-single-Parquet-file)\n", "\t* [3.3 Reading multiple Parquet files](#3.3-Reading-multiple-Parquet-files)\n", "\t\t* [3.3.1 Reading Parquet by list](#3.3.1-Reading-Parquet-by-list)\n", "\t\t* [3.3.2 Reading Parquet by prefix](#3.3.2-Reading-Parquet-by-prefix)\n", "* [4. Fixed-width formatted files (only read)](#4.-Fixed-width-formatted-files-(only-read))\n", "\t* [4.1 Reading single FWF file](#4.1-Reading-single-FWF-file)\n", "\t* [4.2 Reading multiple FWF files](#4.2-Reading-multiple-FWF-files)\n", "\t\t* [4.2.1 Reading FWF by list](#4.2.1-Reading-FWF-by-list)\n", "\t\t* [4.2.2 Reading FWF by prefix](#4.2.2-Reading-FWF-by-prefix)\n", "* [5. Excel files](#5.-Excel-files)\n", "\t* [5.1 Writing Excel file](#5.1-Writing-Excel-file)\n", "\t* [5.2 Reading Excel file](#5.2-Reading-Excel-file)\n", "* [6. Reading with lastModified filter](#6.-Reading-with-lastModified-filter)\n", "\t* [6.1 Define the Date time with UTC Timezone](#6.1-Define-the-Date-time-with-UTC-Timezone)\n", "\t* [6.2 Define the Date time and specify the Timezone](#6.2-Define-the-Date-time-and-specify-the-Timezone)\n", "\t* [6.3 Read json using the LastModified filters](#6.3-Read-json-using-the-LastModified-filters)\n", "* [7. Download Objects](#7.-Download-objects)\n", " * [7.1 Download object to a file path](#7.1-Download-object-to-a-file-path)\n", " * [7.2 Download object to a file-like object in binary mode](#7.2-Download-object-to-a-file-like-object-in-binary-mode)\n", "* [8. Upload Objects](#8.-Upload-objects)\n", " * [8.1 Upload object from a file path](#8.1-Upload-object-from-a-file-path)\n", " * [8.2 Upload object from a file-like object in binary mode](#8.2-Upload-object-from-a-file-like-object-in-binary-mode)\n", "* [9. Delete objects](#9.-Delete-objects)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "import boto3\n", "import pandas as pd\n", "import pytz\n", "\n", "import awswrangler as wr\n", "\n", "df1 = pd.DataFrame({\"id\": [1, 2], \"name\": [\"foo\", \"boo\"]})\n", "\n", "df2 = pd.DataFrame({\"id\": [3], \"name\": [\"bar\"]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enter your bucket name:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import getpass\n", "\n", "bucket = getpass.getpass()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. CSV files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Writing CSV files" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "path1 = f\"s3://{bucket}/csv/file1.csv\"\n", "path2 = f\"s3://{bucket}/csv/file2.csv\"\n", "\n", "wr.s3.to_csv(df1, path1, index=False)\n", "wr.s3.to_csv(df2, path2, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Reading single CSV file" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_csv([path1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Reading multiple CSV files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.3.1 Reading CSV by list" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo\n", "2 3 bar" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_csv([path1, path2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.3.2 Reading CSV by prefix" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo\n", "2 3 bar" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_csv(f\"s3://{bucket}/csv/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. JSON files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Writing JSON files" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['s3://woodadw-test/json/file2.json']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path1 = f\"s3://{bucket}/json/file1.json\"\n", "path2 = f\"s3://{bucket}/json/file2.json\"\n", "\n", "wr.s3.to_json(df1, path1)\n", "wr.s3.to_json(df2, path2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Reading single JSON file" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_json([path1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Reading multiple JSON files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3.1 Reading JSON by list" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo\n", "0 3 bar" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_json([path1, path2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3.2 Reading JSON by prefix" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo\n", "0 3 bar" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_json(f\"s3://{bucket}/json/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Parquet files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more complex features releated to Parquet Dataset check the tutorial number 4." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Writing Parquet files" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "path1 = f\"s3://{bucket}/parquet/file1.parquet\"\n", "path2 = f\"s3://{bucket}/parquet/file2.parquet\"\n", "\n", "wr.s3.to_parquet(df1, path1)\n", "wr.s3.to_parquet(df2, path2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Reading single Parquet file" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_parquet([path1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Reading multiple Parquet files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.1 Reading Parquet by list" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo\n", "2 3 bar" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_parquet([path1, path2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.2 Reading Parquet by prefix" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo\n", "2 3 bar" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_parquet(f\"s3://{bucket}/parquet/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Fixed-width formatted files (only read)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As of today, Pandas doesn't implement a `to_fwf` functionality, so let's manually write two files:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "content = \"1 Herfelingen 27-12-18\\n2 Lambusart 14-06-18\\n3 Spormaggiore 15-04-18\"\n", "boto3.client(\"s3\").put_object(Body=content, Bucket=bucket, Key=\"fwf/file1.txt\")\n", "\n", "content = \"4 Buizingen 05-09-19\\n5 San Rafael 04-09-19\"\n", "boto3.client(\"s3\").put_object(Body=content, Bucket=bucket, Key=\"fwf/file2.txt\")\n", "\n", "path1 = f\"s3://{bucket}/fwf/file1.txt\"\n", "path2 = f\"s3://{bucket}/fwf/file2.txt\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Reading single FWF file" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " <th>date</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>Herfelingen</td>\n", " <td>27-12-18</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>Lambusart</td>\n", " <td>14-06-18</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>Spormaggiore</td>\n", " <td>15-04-18</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name date\n", "0 1 Herfelingen 27-12-18\n", "1 2 Lambusart 14-06-18\n", "2 3 Spormaggiore 15-04-18" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_fwf([path1], names=[\"id\", \"name\", \"date\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Reading multiple FWF files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2.1 Reading FWF by list" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " <th>date</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>Herfelingen</td>\n", " <td>27-12-18</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>Lambusart</td>\n", " <td>14-06-18</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>Spormaggiore</td>\n", " <td>15-04-18</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>Buizingen</td>\n", " <td>05-09-19</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>San Rafael</td>\n", " <td>04-09-19</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name date\n", "0 1 Herfelingen 27-12-18\n", "1 2 Lambusart 14-06-18\n", "2 3 Spormaggiore 15-04-18\n", "3 4 Buizingen 05-09-19\n", "4 5 San Rafael 04-09-19" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_fwf([path1, path2], names=[\"id\", \"name\", \"date\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2.2 Reading FWF by prefix" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " <th>date</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>Herfelingen</td>\n", " <td>27-12-18</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>Lambusart</td>\n", " <td>14-06-18</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>Spormaggiore</td>\n", " <td>15-04-18</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>Buizingen</td>\n", " <td>05-09-19</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>San Rafael</td>\n", " <td>04-09-19</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name date\n", "0 1 Herfelingen 27-12-18\n", "1 2 Lambusart 14-06-18\n", "2 3 Spormaggiore 15-04-18\n", "3 4 Buizingen 05-09-19\n", "4 5 San Rafael 04-09-19" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_fwf(f\"s3://{bucket}/fwf/\", names=[\"id\", \"name\", \"date\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Excel files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1 Writing Excel file" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'s3://woodadw-test/file0.xlsx'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = f\"s3://{bucket}/file0.xlsx\"\n", "\n", "wr.s3.to_excel(df1, path, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2 Reading Excel file" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.s3.read_excel(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Reading with lastModified filter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify the filter by LastModified Date.\n", "\n", "The filter needs to be specified as datime with time zone\n", "\n", "Internally the path needs to be listed, after that the filter is applied.\n", "\n", "The filter compare the s3 content with the variables lastModified_begin and lastModified_end\n", "\n", "https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1 Define the Date time with UTC Timezone" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "begin = datetime.strptime(\"20-07-31 20:30\", \"%y-%m-%d %H:%M\")\n", "end = datetime.strptime(\"21-07-31 20:30\", \"%y-%m-%d %H:%M\")\n", "\n", "begin_utc = pytz.utc.localize(begin)\n", "end_utc = pytz.utc.localize(end)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2 Define the Date time and specify the Timezone" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "begin = datetime.strptime(\"20-07-31 20:30\", \"%y-%m-%d %H:%M\")\n", "end = datetime.strptime(\"21-07-31 20:30\", \"%y-%m-%d %H:%M\")\n", "\n", "timezone = pytz.timezone(\"America/Los_Angeles\")\n", "\n", "begin_Los_Angeles = timezone.localize(begin)\n", "end_Los_Angeles = timezone.localize(end)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.3 Read json using the LastModified filters " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "wr.s3.read_fwf(\n", " f\"s3://{bucket}/fwf/\", names=[\"id\", \"name\", \"date\"], last_modified_begin=begin_utc, last_modified_end=end_utc\n", ")\n", "wr.s3.read_json(f\"s3://{bucket}/json/\", last_modified_begin=begin_utc, last_modified_end=end_utc)\n", "wr.s3.read_csv(f\"s3://{bucket}/csv/\", last_modified_begin=begin_utc, last_modified_end=end_utc)\n", "wr.s3.read_parquet(f\"s3://{bucket}/parquet/\", last_modified_begin=begin_utc, last_modified_end=end_utc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Download objects" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Objects can be downloaded from S3 using either a path to a local file or a file-like object in binary mode." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### 7.1 Download object to a file path" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "local_file_dir = getpass.getpass()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "\n", "path1 = f\"s3://{bucket}/csv/file1.csv\"\n", "local_file = os.path.join(local_file_dir, \"file1.csv\")\n", "wr.s3.download(path=path1, local_file=local_file)\n", "\n", "pd.read_csv(local_file)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### 7.2 Download object to a file-like object in binary mode" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 3 bar" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path2 = f\"s3://{bucket}/csv/file2.csv\"\n", "local_file = os.path.join(local_file_dir, \"file2.csv\")\n", "with open(local_file, mode=\"wb\") as local_f:\n", " wr.s3.download(path=path2, local_file=local_f)\n", "\n", "pd.read_csv(local_file)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 8. Upload objects" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Objects can be uploaded to S3 using either a path to a local file or a file-like object in binary mode." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### 8.1 Upload object from a file path" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>foo</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>boo</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 1 foo\n", "1 2 boo" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "local_file = os.path.join(local_file_dir, \"file1.csv\")\n", "wr.s3.upload(local_file=local_file, path=path1)\n", "\n", "wr.s3.read_csv(path1)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### 8.2 Upload object from a file-like object in binary mode" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>3</td>\n", " <td>bar</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " id name\n", "0 3 bar" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "local_file = os.path.join(local_file_dir, \"file2.csv\")\n", "with open(local_file, \"rb\") as local_f:\n", " wr.s3.upload(local_file=local_f, path=path2)\n", "\n", "wr.s3.read_csv(path2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Delete objects" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "wr.s3.delete_objects(f\"s3://{bucket}/\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.14", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.14" } }, "nbformat": 4, "nbformat_minor": 4 }

tutorials/003 - Amazon S3.ipynb (1,766 lines of code) (raw):